ActiveClean: Interactive Data Cleaning For Statistical Modeling

نویسندگان

  • Sanjay Krishnan
  • Jiannan Wang
  • Eugene Wu
  • Michael J. Franklin
  • Kenneth Y. Goldberg
چکیده

Analysts often clean dirty data iteratively–cleaning some data, executing the analysis, and then cleaning more data based on the results. We explore the iterative cleaning process in the context of statistical model training, which is an increasingly popular form of data analytics. We propose ActiveClean, which allows for progressive and iterative cleaning in statistical modeling problems while preserving convergence guarantees. ActiveClean supports an important class of models called convex loss models (e.g., linear regression and SVMs), and prioritizes cleaning those records likely to affect the results. We evaluate ActiveClean on five real-world datasets UCI Adult, UCI EEG, MNIST, IMDB, and Dollars For Docs with both real and synthetic errors. The results show that our proposed optimizations can improve model accuracy by up-to 2.5x for the same amount of data cleaned. Furthermore for a fixed cleaning budget and on all real dirty datasets, ActiveClean returns more accurate models than uniform sampling and Active Learning.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models

Data cleaning is often an important step to ensure that predictive models, such as regression and classification, are not affected by systematic errors such as inconsistent, out-of-date, or outlier data. Identifying dirty data is often a manual and iterative process, and can be challenging on large datasets. However, many data cleaning workflows can introduce subtle biases into the training pro...

متن کامل

Improving Data Quality by Leveraging Statistical Relational Learning

Digitally collected data su↵ers from many data quality issues, such as duplicate, incorrect, or incomplete data. A common approach for counteracting these issues is to formulate a set of data cleaning rules to identify and repair incorrect, duplicate and missing data. Data cleaning systems must be able to treat data quality rules holistically, to incorporate heterogeneous constraints within a s...

متن کامل

The State-of-the-Art in Predictive Visual Analytics

Predictive analytics embraces an extensive range of techniques including statistical modeling, machine learning, and data mining and is applied in business intelligence, public health, disaster management and response, and many other fields. To date, visualization has been broadly used to support tasks in the predictive analytics pipeline. Primary uses have been in data cleaning, exploratory an...

متن کامل

A comparative study of the cleaning effect of various ultrasonic cleaners on ‎new, unused endodontic instruments

BACKGROUND AND AIM: This study was carried out to compare three different ultrasonic cleaner devices in the cleaning process of endodontic instruments by scanning electron microscope (SEM). METHODS: In this study, 120 unused brand new hand and rotary instruments were examined after removing from the sealed package. The instruments were randomly divided into six groups of 20 rotary or hand files...

متن کامل

Statistical Semantic and Clinician Confidence Analysis for Real-Time Clinical Progress Note Cleaning

Clinical progress notes serve as a record of both the narrative text as well as structured data about a patient’s health problems that could be used for analysis and decision support. Time and efficiency pressures, however, have ensured clinicians’ continued preference for unstructured text over entering data in forms when composing progress notes. At the same time, the automatic extraction of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 9  شماره 

صفحات  -

تاریخ انتشار 2016